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Abstract 



We generalize stochastic subgradient descent methods to situations in which we do not receive 
independent samples from the distribution over which we optimize, instead receiving samples 
coupled over time. We show that as long as the source of randomness is suitably ergodic — it con- 
verges quickly enough to a stationary distribution — the method enjoys strong convergence guar- 
antees, both in expectation and with high probability. This result has implications for stochastic 
optimization in high-dimensional spaces, peer-to-peer distributed optimization schemes, deci- 
sion problems with dependent data, and stochastic optimization problems over combinatorial 
' spaces. 



1 Introduction 

. In this paper, we analyze a new algorithm, Ergodic Mirror Descent, for solving a class of stochastic 

I I optimization problems. We begin with a statement of the problem. Let {F{-;^),^ S H} be a 

OO ' collection of closed convex functions whose domains contain the common closed convex set X C W^. 

■ Let n be a probability distribution over the statistical sample space H and consider the convex 

• ! function / : — >• M defined by the expectation 

o: r 

fix) := En[F(x; 0] = F{x; ^)dUiO. (1) 
We study algorithms for solving the following problem: 

>< 

H ' minimize f{x) subject to x £ X. (2) 

CZ • ^ 

A wide variety of stochastic optimization methods for solving the problem ([2]) have been explored 
in an extensive literature [351 [351 [23 [251 [30] ■ We study procedures that do not assume it is possible 
to receive samples from the distribution H, instead receiving samples ^ from a stochastic process P 
indexed by time t, where the stochastic process P converges to the stationary distribution U. This 
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is a natural relaxation, because in many circumstances the distribution 11 is not even known — for 
example in statistical applications — and we cannot receive independent samples. In other scenarios, 
it may be hard to even draw samples from 11 efficiently, such as when S is a high-dimensional or 
combinatorial space, but it is possible [19] to design Markov chains that converge to the distribution 
n. Further, in computational applications, it is often unrealistic to assume that one actually has 
access to a source of independent randomness, so studying the effect of correlation is natural and 
important [T7] . 

Our approach to solving the problem ([2]) is related to classical stochastic gradient descent 
algorithms |35[ I32j . where one assumes access to samples ^ from the distribution 11 and performs 
gradient updates using VF{x; ^). When 11 is concentrated on a set of n points and the functions F 
are not necessarily differentiable, the incremental subgradient method of Nedic and Bertsekas [29j 
applies, and the objective is of the form /(x) = ^ '^7=1 fii^)- More generally, our problem belongs 
to the family of stochastic problems with exogenous correlated noise [22] where the goal is to 
minimize En[-F(x;^)] as in the objective ([2]), but we have access only to samples ^ that are not 
independent over time. Certainly a number of researchers in control, optimization, stochastic 
approximation, and statistics have studied settings where stochastic data is not i.i.d. (see, for 
example, the books \22\ I38j and the numerous references therein). Nonetheless, classical results in 
this setting are asymptotic in nature and generally do not provide finite sample or high-probability 
convergence guarantees; our work provides such results. 

Our method borrows from standard stochastic subgradient and stochastic mirror descent method- 
ology [3H [30] , but we generalize this work in that we receive samples not from the distribution 11 but 
from an ergodic process ^1,^2) • • • converging to the stationary distribution 11. In spite of the new 
setting, we do not modify standard stochastic subgradient algorithms; our algorithm receives sam- 
ples and takes mirror descent steps with respect to the subgradients of F(x;^t). Consequently, 
our approach generalizes several recent works on stochastic and non-stochastic optimization, in- 
cluding the randomized incremental subgradient method [29] as well as the Markov incremental 
subgradient method |20[ 134] . There are a number of applications of this work: in control problems, 
data is often coupled over time or may come from an autoregressive process [22j; in distributed 
sensor networks [23j, a set of wireless sensors attempt to minimize an objective corresponding to 
a sequence of correlated measurements; and in statistical problems, data comes from an unknown 
distribution and may be dependent |42]. See our examples and experiments in § [Hand § [5l as well 
as the examples in the paper by Ram et al. [34], for other motivating applications. 

The main result of this paper is that performing stochastic gradient or mirror descent steps 
as described in the previous paragraph is a provably convergent optimization procedure. The 
convergence is governed by problem-dependent quantities (namely the radius of X and the Lipschitz 
constant of the functions F) familiar from previous results on stochastic methods [29\ \^3\ [30] and 
also depends on the rate at which the stochastic process ^i,^2,--- converges to its stationary 
distribution. Our three main convergence theorems characterize the convergence rate of Ergodic 
Mirror Descent in terms of the mixing tini6 T^ix 

(the time it takes the process to converge to the 
stationary distribution 11, in a sense we make precise later) in expectation, with high probability, 
and when the mixing times of the process are themselves random. In particular, we show that 
this rate is O (y^^^) for a large class of ergodic processes, both in expectation and with high 
probability. We also give a lower bound that shows that our results are tight: they cannot (in 
general) be improved by more than numerical constants. 

The remainder of the paper is organized as follows. Section [2] contains our main assumptions 
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and a description of the algorithm. Following that, we collect our main technical results in § [3j 
We expand on these results in example corollaries throughout § [H and give numerical simulations 
exploring our algorithms in § [5l We provide complete proofs of all our results in § [6] and the 
appendices. 

Notation For the reader's convenience, we collect our (standard) notation here. A function / is 
G-Lipschitz with respect to a norm ||-|| if \f{x) — f{y)\ < G ||x — The dual norm ||-||^ to a norm 
||-|| is defined by ||z||^ := supj|^.j|<;^ {z,x). A function ^p is strongly convex with respect to the norm 
II -11 over the domain X if 

1 2 

V'(y) ^ ip{x) + {\/'tp{x),y — x) + - ||x — y\\ for x,y ^ X. 

For a convex function /, we let df{x) = {g \ f{y) > f{x) + {g,y — x)} denote its subdifferen- 
tial. For a matrix A G M"^"*, we let Pi{A) denote its ith largest singular value, and when A £ M"^" 
is symmetric we let Xi{A) denote its ith largest eigenvalue. The all-ones vector is 1, and we denote 
the transpose of the matrix A by A~^ . We let [n] denote the set {1, . . . ,n}. For functions / and 5, 
we write f{n) = 0{g{n)) if there exist N < 00 and G < 00 such that f{n) < Cg{n) for n > N, 
and /(n) = Q{g{n)) if there exist N < 00 and c > such that f{n) > cg{n) for n > N. For a 
probability measure P and measurable set or event A, P{A) denotes the mass P assigns A. 

2 Assumptions and algorithm 

We now turn to describing our algorithm and the assumptions underlying it. We begin with a de- 
scription of the algorithm, which is familiar from the literature on mirror descent algorithms |31l [3] . 
Specifically, we generalize the stochastic mirror descent algorithm |3H 130]. which in turn general- 
izes gradient descent to elegantly address non-Euclidean geometry. The algorithm is based on a 
prox-function ip, a differentiable convex function defined on X assumed (w.l.o.g. by scaling) to be 
1-strongly convex with respect to the norm ||-|| over X. The Bregman divergence generated by 
ip is defined as 

D^{x, y) := i;{x) - \lj{y) - (VV'(y), x - y) > - ||x - yf . (3) 
We assume X is compact and that there exists a radius R < 00 such that 

D^{x,y)<^R^ foTX,yeX. (4) 

The Ergodic Mirror Descent (EMD) algorithm is an iterative algorithm that maintains a param- 
eter x{t) G X, which it updates using stochastic gradient information to form x{t + 1). Specifically, 
let denote the distribution of the stochastic process P at time t. We assume that we receive a 
sample ~ P* at each time step t. Given ^t, EMD computes the update 

g{t) £dF{xity,Ct), x{t + l) = argmm\{g{t),x) + ^D^{x,x{t))\. (5) 

The initial point x{l) may be selected arbitrarily in X, and here a{t) is a non-increasing (time- 
dependent) stepsize. The algorithm ^ reduces to projected gradient descent with the choice 
ip{x) = ^ ||x||2, since then D^{x, y) = ^ \\x — y\\2- 
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Our main assumption on the functions F{-;S^) regards their continuity and subdifferentiabiHty 
properties, though we require a bit more notation. Let G(x;^) G dF{x;S,) denote a fixed and 
measurable element of the subgradient of evaluated at the point x, where (without loss of 

generality) we assume that in the EMD algorithm ([5]) we have g{t) = G{x{t);^t)- We let J-'t denote 
the (T- field of the first t random samples ^i, . . . , from the stochastic process P (that is, is drawn 
according to P*). We make one of the following two assumptions, where in each the norm ||-|| is 
the norm with respect to which ip is strongly convex ([3]): 

Assumption A (Finite single-step variance). Let x be measurable with respect to the a-field Tt-i- 
There exists a constant G < oo such that with probability 1 

nmx;it)\\i\j't-i]<G\ 

Assumption B. For H-almost every ^, the functions F{-;S^) are G-Lipschitz continuous functions 
with respect to a norm \\-\\ over X . That is, 

\F{x; - F{y; ^)\<G\\x- y\\ for x,yeX. 

As a consequence of Assumption iB] for any g G dF{x;^) we have that H^H^ < G (e.g., [IS]), and 
it is clear that the expected function / is also G-Lipschitz. Assumption [B] implies Assumption \A\ 
though Assumption El still guarantees / is G-Lipschitz, and under either assumption, we have 

< G^. (6) 

Having described the family of functions {-F(-;^) : ^ G H}, we recall a few definitions from 
probability theory that are essential to the presentation of our results. We measure the convergence 
of the stochastic process P using one of two common statistical distances [L3\: the Hellinger distance 
and the total variation distance (our definitions are a factor of 2 different from some definitions of 
these metrics). The total variation distance between probability distributions P and Q defined on 
a set S, assumed to have densities p and q with respect to an underlying measure /i0 is 

d^AP,Q) ■■= I - qiOmO = 2 sup |P(^) - Q{A)\, (7) 
Js Age 

the supremum taken over measurable subsets of H. The squared Hellinger distance is 

dUP,Qf = ^ - ^iOMO = ^ {Vm-^M)yd^^{0. (8) 

It is a well-known fact [13] that for any probability distributions P and Q, 

dheiiP, Q? < c^Tv (P^ Q) < 24ci(^', Q)- (9) 

Using the total variation (j7]) and Hellinger ([8]) metrics, we now describe our notion of mixing 
(convergence) of the stochastic process P. Recall our definition of the c-field Tt = o"(^i, . . . ,^t)- 
Let denote the distribution of conditioned on (i.e. given the initial samples ^i, . . . , ^g), so 
for measurable j4 C H we have Pj*^j(j4) := i-*(^t £ ^ I Fg). We measure convergence of P to H in 
terms of the mixing time of the different P^^^ , defined for the Hellinger and total variation distances 
as follows. In the definitions, let pj^j and n denote the densities of and H, respectively. 

^This is no loss of generality, since P and Q are absolutely continuous witli respect to P + Q. 



E 



|G(x;e)ii: 



E 



E 



\Gix;m:\^i 



t-1 



4 



Definition 1. The total variation mixing time T^^(P[^],e) of the sampling distribution P condi- 
tioned on the a-field of the initial s samples J-g = ct(^i, . . . , is the smallest f £ N such that 
d,,(P[^+*,n)<e, 

r^^(P[,],e) := inf |i - s : t G N, £ \p\^^{i) - vr(e)| < e} • 

The Hellinger mixing time Thei(-P[s]) e) is the smallest t such that dhci(P['^]^*, H) < e, 

rUP[s],e) :=inf|t-s:tGN, ^ (^^^ - 7^)' dMO < e'} • 

Put another way, the mixing times r^y(P[s],e) and Thei(-P[s], e) are the number of additional 
steps required until the distribution of is close to the stationary distribution 11 given the initial 
s samples ^i, . . . , 

The following assumption, which makes the mixing times of the stochastic process P uniform, 
is our main probabilistic assumption. 

Assumption C. The mixing times of the stochastic process {^j} are uniform in the sense that 
there exist uniform mixing times r^^ (P, e), rhci(-P, e) < oo such that with probability 1, 

{P, e) > -^Tv {P[s] , e) and rhci(P, e) > Thci(P[s] , e) 

for all e > and s G N. 

Assumption [C] is a weaker version of the common assumption of (/)-mixing in the probability 
literature (e.g. [9j); 0- mixing requires convergence of the process over the entire "future" cr- field 
a{^t,S,t+i, ■ ■ ■) of the process Any finite state-space time-homeogeneous Markov chain satisfies 
the above assumption, as do uniformly ergodic Markov chains on general state spaces |27j . 

We remark that the definition [1] of mixing time does not assume that the distributions 
are time- homogeneous. Indeed, Assumption [C] requires only that there exists a uniform upper 
bound on the mixing times. We can weaken Assumption [Cl to allow randomness in the probability 
distributions Pj*j themselves, that is, conditional on Ts, the mxing time T^^{P^g^,e) is an J^g- 
measurable random variable. Our weakened probabilistic assumption is 

Assumption D. The mixing times of the stochastic process {^j} are stochastically uniform in 
the sense that there exists a uniform mixing time r^y(P, e) < oo, continuous from the right as a 
function of e, such that for all e > 0, s G N, and c G M 

P{ttv{Pis],(^) > Ttv(^' e) + i^c) < exp(-c). 

Assumption [P] allows us to provide convergence guarantees for a much wider range of processes, 
such as auto-regressive processes, than permitted by Assumption O 



3 Main results 

With our assumptions in place, we can now give our main results. We begin with three general 
theorems that guarantee the convergence of the EMD algorithm in expectation and with high 
probability. The second part of the section shows that our analysis is sharp — unimprovable by 
more than numerical constant factors — by giving an information-theoretic lower bound on the 
convergence rate of any optimization procedure receiving non-i.i.d. samples from P. 
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3.1 Convergence guarantees 

Our first result gives convergence in expectation of tlie EMD algorithm ([5]) ; we provide the proof 



m 

Theorem 1. Let Assumption hold and let x{t) be defined by the EMD update ([5]) with non- 
increasing stepsize sequence {a{t)}. Let x* ^ X be arbitrary and let ([4]) hold. If Assumption HI 
holds, then for any e > 0, 

- T 

E 



t=i 



Y^{f(x{t))-f{x*)) 

+ — X] + ^TeGR + (Thei(P, e) - 1) ^ a{t) + RG 



2a{T) 



t=i 



t=i 



while if AssumptionlM holds, then for any e > 0, 

T 



E 



Y,ifixit))-fix*)) 



t=l 



- wtt + t ^ "^^^ + ^'^^ + ^^^^ ^) - 1) E + 

^'^4=1 L t=l 

T/ie expectation in both bounds is taken with respect to the samples .^i , . . . , • 

We obtain an immediate corollary to Theorem [1] by applying Jensen's inequality to the convex 
function /: 

Corollary 1. Define x{T) = y X^^i ^(0 ^'^'^ ^he conditions of Theorem U\ hold. If Assump- 
tion H] holds, then for any e > 



E[/(£(T)) - fix^)] < 



G' 



2 T 



2a{T)T 2T ^ 

If Assumption [Sl holds, then for any e > 

?2 n2 T 



rhei(P,e)-l 



E[/(x(T)) - /(x^)] < + %Y^ a{t) + eGi? + ^"^^^ ^ [g^ J] a(t) + i?G 



2a(T)r 2T ^ 



Corollary [T] shows that so long as the stepsize sequence a{t) is non- increasing and satisfies the 
asymptotic conditions Ta{T) — )• oo and {^/T)Y^^^^a{t) — >• 0, the EMD method converges. We 
can also provide similar high-probability convergence guarantees: 

Theorem 2. Let the conditions of TheoremUl and Assumption\Mhold. Let 5 £ (0, 1) and define the 
average x{T) = ^ Ylt=i x{t). With probability at least 1 — 6, for e > such that T^y{P,e) < T/2, 



f{x{T)) - fix*) < 



R^ 



2Ta{T) 2T ^ 
+ eGR + AGR 



G^ a{t) + GR 



t=i 



'r,,(P,e)log^^ 
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We provide the proof of this theorem in ^ 16.31 Note that the rate of convergence in Theorem [2] is 
identical to that obtained in Theorem [T] plus an additional term that arises as a result of the control 
of the deviation of the ergodic process around its expectation. The additional log ^-dependent 
term arises from the application of martingale concentration inequalities [2], which requires some 
care because the process {S^t} is coupled over time. Nonetheless, as we discuss briefly following 
Corollary [2] — and as made clear by our lower bound in Theorem H] — the additional terms introduce 
a factor of at most y^log r^y {P, e) to the bounds. That is, the dominant terms in the convergence 
rates (modulo logarithmic factors) also appear in the expected bounds in Theorem [TJ 

The last of our convergence theorems extends the previous two to the case when the stochastic 
process is not uniformly mixing, but has mixing properties that may depend on its state. We 
provide the proof of Theorem [3] in § 16. 4[ 

Theorem 3. Let the conditions of Theorem hold, except that we replace the uniform mixing 
assumption m with the probabilistic mixing assumption Wi Let 6 G (0,1). In the notation of As- 
sumption [Dl define 

r(e, 6) := r^^ {P, e) + k (^log ^ + 2 log(r) 
With probability at least 1 — 6, for any x* G X , 

f{x{T)) - fixn < inf { + g f: a{t) + ^^^^ [g^ «(^) + 



+ eGR + AGR\ 



r(e,5)log 



r(.,5) 



In § 14.31 we give two applications of Theorem [3] (to estimation in autoregressive processes and 
a fault-tolerant distributed optimization scheme) that show how it makes the applicability of our 
development substantially broader. 

We now turn to a slight specialization of our bounds to build intuition and attain a simplified 
statement of convergence rates. Theorems (TJ [2l and [3] hold for essentially any ergodic process that 
converges to the stationary distribution 11. For a large class of processes, the convergence of the 
distributions P* to the stationary distribution 11 is uniform and at a geometric rate |27j : there exist 
constants ki and K2 such that t^v(P, e) < Kilog(K2/e). We have the following corollary for this 
special case; we only present the version yielding expected convergence rates, as the high-probability 
corollary is similar. In addition, by the fact ([9]) relating dhel to d^^, if the process P satisfies 
T^y{P,e) < Kilog(K2/e), then there exist constants k[ and Kg such that rhei(-P, e) < k'^ log(K2/e). 
Thus we only state the corollary for total variation mixing and under Assumption [HI an analogous 
result holds under Assumption [Xl for mixing with respect to the Hellinger distance. 

Corollary 2. Under the conditions of TheoremUl assume in addition that Trj,^(P,e) < ki log(K2/e) 
and let Assumption\^ hold. The EMD update ([5]) with stepsize a{t) = aj^ft satisfies 

E [/(S,T,) - /(.-)] < + ^ ,og ^) + .OR + ^^^i!^. 



(10) 



Proof Using the definition a{t) = a/\/t and the integral bound 



1 

V ^ < 1 + / t'^/^dt = 2Vt - 1 < 2Vt, 
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we have Yl't=i '^(0 — 2a\/T'- The corollary now follows from Theorem [TJ □ 

We can obtain a simplified convergence rate with appropriate choice of the stepsize multiplier 
a and mixing parameter e: choosing a = R/ {G^J ni log(K2T)) and e = T~^/^ reduces the corollary 
to 

E[/(S(T)) -/(.•)] =of^^i^^#M)V (11) 



More generally, using the stepsize a{t) = a/^/t and the same argument as in Corollary [2] gives 



Again choosing e = T ^^"^ and defining the shorthand Tmix = Trj,y{P,T ^^'^), by choosing a = 
R/ (G^Tmix)) we see the bound implies that 

nmT)) - fix^] < ^ • ^ + ^ + "^^^"7"'^ - (13) 



In the classical setting [3Dj of i.i.d. samples ^ ~ 11, stochastic gradient descent and its mirror descent 
generalizations attain convergence rates of 0{RG/VT). Since T^y{P, 0) = Thei(i', 0) = 1 for an i.i.d. 
process, the rate p2|) shows that our results subsume existing results for i.i.d. noise. Moreover, they 
are sharp in the i.i.d. case, that is, unimprovable by more than a numerical constant factor [3H [T], 
In addition, we note that the conclusions of Corollary [2] (and the bound (I12p ) hold — modulo 
an additional log r^^ (P, e) — with high probability. We may also note that replacing eGR with 
3eGR and r^^ with Thei in the bound (fT2]) yields a guarantee under Assumption El Further, the 
step-size choice a{t) = aj^ft is robust — in a way similarly noted by Nemirovski et al. [30] — for 
quickly mixing ergodic processes. Indeed, using the inequalities (fT2]) and p^ . we see that setting 
the multiplier a = -iR/{Gy/¥~) yields E[/(x(r)) - f {x*)] = C'(max{7, 7-i}iiG/w/\/r), so 
mis-specification of a by a constant 7 leads to a penalty in convergence that scales at worst linearly 
in max{7~^,7}. In classical stochastic approximation settings [351 [381 [22] , one usually chooses step 
size sequence a{t) = 0{t~"^) for m G (.5, 1]; in our case, such choices may yield sub-optimal rates 
because we study convergence of the averaged parameter x{T) rather than the final parameter 
x{t). Nonetheless, averaging is known to yield robustness in i.i.d. settings [321 130] . and moreover 
gives unimprovable convergence rates in many cases (see § 13.21 as well as references [31[ [T]). We 
provide some evidence of this robustness in numerical simulations in § [5l and we see generally that 
EMD has qualitative convergence behavior similar to stochastic mirror descent for a broad class of 
ergodic processes. 

Before continuing, we make two final remarks. First, none of our main theorems assume Marko- 
vianity or even homogeneity of the stochastic process P; all that is needed is that the mixing time 
(or Thei) exists, or even that it exists only with some reasonably high probability. Previous 
work similar to ours [34[ |20] assumes Markovianity (see also our discussion concluding § 14. 3p . Fur- 
ther, general ergodic processes do not always enjoy the geometric mixing assumed in Corollary [21 
satisfying either Assumption [Dis probabilistic mixing condition or simply mixing more slowly. In 
§ 14. 3[ we present examples of such probabilistically mixing processes on general state spaces, while 
the bound (I12p suggests an approach to attain convergence for more slowly mixing processes (see 
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3.2 Lower bounds and optimality guarantees 



Our final main result concerns the optimaUty of the results we have presented. Informally, the 
theorem states that our results are unimprovable by more than numerical constant factors, though 
making this formal requires additional notation. In the stochastic gradient oracle model of convex 
optimization [31^ [1], a method Ai issues queries of the form x G to an oracle that returns 
noisy function and gradient information. In our setting, the oracle is represented by the pair 
9 = (P, G), and when the oracle is queried at a point x at time t (i.e., this is the tth. query 9 
has received), it draws a sample according to the distribution P(- | and returns 

Gix,Ct) e K'^- The method issues a sequence of queries x{l), . . . ,x{t) to the oracle and may use 
{G(a;(l), ^i), . . . , G(x(t),^t)} to devise a new query point x{t + 1). For an oracle 9, we define the 
error of the method on a function / after T queries of the oracle as 



where x denotes the method A^'s estimate of the minimizer of / after seeing the T samples 
{G{x{l),^i), . . . ,G{x{T),^t)}- The quantity is random, so we measure accuracy in terms 
of the expected value Ee[eT(A^, /, A", 6*)], where the expectation is taken with respect to the ran- 
domness in 9. 

Now we define a natural collection of stochastic oracles for our dependent setting. 

Definition 2. For f convex, r G N, G G (0,oo), and p G [l,oo], the admissible oracle set 
Q {f,T,G,p) is the set of oracles 6 = {P,G) for which there exists a probability distribution H 
on ^ such that 



The set (/, r, G, p) is the collection of oracles 9 = (P, G) for which the distribution P has 
stationary distribution 11, mixing time bounded by r, and returns ^p-norm bounded stochastic 



and [B] hold, while (Pj^j*"^ , IT) = satisfies Assumption [Cl With Definition [21 for any collection 
C of convex functions /, we can define the minimax error over distributions with mixing times 
bounded by r as 

eUC,X,T,G,p):= mi sup sup Kg[eT{M, f,X,9)] . (15) 

fee eee{f,T,G,p) 

We have the following theorem on this minimax error (see § 16.51 for a proof). 

Theorem 4. Let C R'^ 6e a convex set containing the £oo l^all of radius r for some r > 0. Let 
1/p+l/q = 1 andp > 1 and let the setC consist of convex functions that are G-Lipschitz continuous 
with respect to the iq-norm over the set X. For p G [1,2] and for any r G N, the minimax oracle 
complexity (fTSjl satisfies 



eTiMJ,X,9) = f{x)- inf /(x) 



(14) 



G{x;C)\\p <G forxeX andCeE, En[G{x;C)] G Of{x) for x e X 
and drj,^ (P*^^,!!) = for all t £ 'N with probability 1. 






(16a) 



For p G [2, oo] and for any r G N, the minimax oracle complexity (jlSp satisfies 




(16b) 
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We make a few brief comments on the implications of Theorem 01 First, the dependence on r 
and T in the bounds of sJt jT matches that of the upper bound ([13]) • In addition, fohowing the 
discussion of Agarwal et al. [H Section III. A and Appendix C], we can see that the dependence of 
the bounds (|16ap and (|16bp on the quantities r, G, and the dimension d are optimal (to within 
logarithmic factors). In brief, the bound (|16ap is achieved by taking ?/;(x) = \ \x\}^ in the definition 
of the proximal function for the EMD algorithm, while the bound ()16bp is achieved by taking 
V'(x) = \ ||a:||^ for g = 1 + l/log(d) (see also [H O Section 5]). Summarizing, we find that 
Theorems [iHS] are unimprovable by more than numerical constants, and the EMD algorithm ([5]) 
attains the minimax optimal rate of convergence. 



4 Examples and Consequences 

We now collect several consequences of the convergence rates of Theorems (H [2l and [3] to provide 
insight and illustrate applications of the theoretical statements. We begin with a concrete example 
and move toward more abstract principles, completing the section with finite sample and asymptotic 
convergence guarantees for more slowly mixing ergodic processes. Most of the results are new or 
improve over previously known bounds. 



4.1 Peer-to-peer optimization and Markov incremental gradient descent 

The Markov incremental gradient descent (MIGD) procedure due to Johansson et al. [20] is a 
generalization of Nedic and Bertsekas's randomized incremental subgradient method [29], which 
Ram et al. [34] further analyze. The motivation for MIGD comes from a distributed optimization 
algorithm using a simple (locally computable) peer-to-peer communication scheme. We assume 
we have n processors or computers, each with a convex function /j : — )• M, and the goal is to 
minimize 

1 

j(x) = — ^^/i(x) subject to X £ X. (17) 

1=1 

The procedure works as follows. The current set of parameters x{t) £ X is passed among the 
processors in the network, where a token i{t) G [n] indicates the processor holding x{t) at iteration 
t. At iteration t, the algorithm computes the update 



g{t) e dfi(t){x{t)), x{t + l) = argmin<^ {a{t),x) + ——D^{x,x{t)) 



a{t)' 



after which the token i(t) moves to a new processor. This update is a generalization of the pa- 
pers [20l E], which assume ^(x) = ^ H^Hg. Slightly more generally, the local functions may be 
defined as expectations, fi{x) = EnJ-F(a;; ^)], for a local distribution Ilj. At iteration t, a sample 
£,t,i{t) is drawn from the local distribution Ili{t) and the algorithm computes the update 

g{t) G aF(x(0;6,*(t)), x(t + l)=a.Tgmm\{g{t),x) + ^D^{x,x{t))] . (18) 

We view the token i{t) as evolving according to a Markov chain with doubly-stochastic transition 
matrix P, so its stationary distribution is the uniform distribution. In this case. 



P(i(t) = j I iit -l)=i)=Pi 
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The total variation distance of the stochastic process initiaUzed at i{0) = i from the true (uniform) 
distribution is — IL/n||^, where denotes the ith standard basis vector. In addition, since P 

is doubly stochastic, we have PI = 1 and thus 

\\P^ei - l/n\\^ < V^\\P*ei - l/n\\^ = ^\\p\ei - < V^P2{PY \\ei - l/nW^ < ^p2{P)\ 

where P2{P) denotes the second singular value of the matrix P. From this spectral bound on the 
total variation distance, we see that \i t > 2 we have ||P*e,- — l/nll, < In addition, 

' — logp2(-P) ^ II ' '111 — ' 

recalling the sandwich inequalities ([9]), we have 



dhei(P*ei,l/n) < Jd^^(P*ei,l/n) < n^'^p2{P) 



t/2 



SO dheiiP^Gi, 1/n) < 1/VT when t > iJ°p^\^pyi ■ In the notation of Assumption [Cl 

^ .p^-i/2N^ log(Tn) log(rn) AP T'^'^^ < ^°g(^") (^^\ 

^^^^ ' 2\ogp2{P)-^- 2{l-p2{P)) ^^^'^^'^ ^-l-p2{Py ^^^^ 

(Since log « 1 — /? for p 1, using 1 — p is no significant loss in our applications.) Consequently, 
we have the following result, similar to Corollary [2j 

Corollary 3. Let x{t) evolve according to the Markov incremental descent update (jlSp . where i{t) 
evolves via the doubly stochastic transition matrix P and a{t) = aj^ft. Define x{T) = ^ Ylt=i ^(^) 
and Tmix = \/log(Tn) j \J\ — p2 (P) . Choose stepsize multiplier a = R/G^/t^. If for each distri- 
bution Hi we have EnJ||G(x;^)||^] < , then 

E[/(x(r))] - fixn < ^ . + ^ + :^ . (20) 

Let 6 G (0, 1) and assume Tmix If for each i and Hi- almost every ^ we have ||G(x; ^)||^ < G, 

then with probability at least 1 — 6 



f[x{T)) - f[X ) < — — • -- H ^ + ■ Tmix H Tmix log 



mix 



Proof The proof is a consequence of Theorems [T] and [2] and Corollary [2j We use the uniform 
bound (|19p on the mixing time of the random walk, in Hellinger or total variation distance, and 
the result follows via algebra. □ 



Corollary [3] gives convergence rates sharper and somewhat more powerful than those in the 
original Markov incremental gradient descent papers \20\ 134] . First, our results allow us to use 
mirror descent updates, thus applying to problems having non-Euclidean geometry; it is by now well 
known that this is essential for obtaining efficient methods for high-dimensional problems |3HB1[3]. 
Secondly, because we base our convergence analysis on mixing time rather than return times, we 
can give sharp high-probability convergence guarantees. Finally, our convergence rates are often 
tighter. Ram et al. [34] do not appear to give finite sample convergence rates, and as discussed by 
Duchi et al. [H], Johansson et al. [20] show that MIGD — with optimal choice of their algorithm 
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parameters — has convergence rate 0{RGmaxi ^J ^^), where T is the return time matrix given by 

r = (/ — P + 11^ /n)^^. When P is symmetric (as in \20\ Lemma 1]), the eigenvalues of T are 1 
and 1/(1 - Xi{P)) for i > 1, and 

" 1 1 
nmaxTii > tr(r) = 1 + "S^ — — > — . 

/L.1_A,(P) 1-P2{P) 

Thus, up to fogarithmic factors, the bound (j20p from Corollary[3]is never weaker. For weh-connected 
graphs, the bound is substantially stronger; for example, a random walk on an expander graph has 
constant spectral gap [10], so (1 — P2{P))^^ = 0{1), while the previous bound is nmaxjg[„] Fji = 
Q{n). 



4.2 Optimization over combinatorial spaces 

For our second example, we consider settings where H is a combinatorial space from which it is 
difficult to obtain uniform samples but for which we can construct a Markov chain that converges 
quickly to the uniform distribution over S. See Jerrum and Sinclair [19j for an overview of such 
problems. More concretely, consider the statistical problem of learning a ranking function for web 
searches. The statistician receives information in the form of a user's clicks on particular search 
results, which impose a partial order on the results (since only a few are clicked on). We would like 
the resulting ranking function to be oblivious to the order of the remaining results, which leads us 
to define H to be the set of all total orders of the search results consistent with the partial order 
imposed by the user. Certainly the set H is exponentially large; it is also challenging to draw a 
uniform sample from it. 

Though sampling is challenging, it is possible to develop a rapidly-mixing Markov chain whose 
stationary distribution is uniform on H. Specifically, Karzanov and Khachiyan [21] develop the 
following Markov chain. Let V he a partial order on the set [n], whose elements are of the form 
i -< j for i,j G [n]. The states of the Markov chain are permutations a of [n] respecting the partial 
order V, and the Markov chain transitions between permutations a and a' by randomly selecting a 
pair i,j G [n], then swapping their orders if this is consistent with the partial order V. Wilson |41] 
showed that the mixing time of this Markov chain is bounded by 

TTv(i^,e) < 4^'log-. (21) 

Similar results hold for sampling from other combinatorial spaces [19]. 

Theorem [2] gives the following consequence of the bound (j2ip on the mixing time of the 
Karzanov-Khachiyan Markov chain. Denote the set of permutations a consistent with the par- 
tial order "P by o" G "P, so the objective ([T]) has the form 

f{x) := y F{x;a). 

■' ^ ' card((7 E PW-^ ^ ^ 

We have 

Corollary 4. Let x{t) evolve according to the EMD update ([5]), where the sample space is the set of 
permutations {a} consistent with the partial order V over [n]. Define x{T) = ^Ylt=i^{'^)- Under 
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AssumptionWi with a{t) = aj \ft and the choice of multiplier a = tt 



..^.^NN *N 5Gi? n3/Vlog(Tn) RGn^logiTn) AGR ^^5- — — — — . — — -r- 
f{x{T)) - fix*) < — ^7^^ + 2T + TT ■ V^'l°g(^^)(log[W'^)log(^^)]) 

with probability at least 1 — 6, where 5 G (0, 1). 
4.3 Probabilistically mixing processes 

We now turn to two examples to show the broader appHcabihty of the EMD algorithm guaranteed 
by Theorem [3j Our first example generalizes the Markov incremental gradient method of § 14.11 to 
allow random communication matrices P, while our second considers optimization problems where 
the data comes from a (potentially nonlinear) autoregressive moving average (ARM A) process. For 
both examples, we require a conversion from expected convergence of the total variation distance 
(P^jj*"^ , n) as r — )• cxD to the probabilistic bound in Assumption [Dl To that end, we prove the 
following lemma in Appendix [Dl 

Lemma 5. Let E[d^y (Pj+^, H)] < Kp'^ for all r G N, where K > 1 and p e (0, 1). Define 

r^y(P, e):= , . ^ ^ , + rr^ — r +1 and 
For any e G (0, 1] and c G M, 



log p\ I log p\ 



log p\ 



P (^Tvl-^M'^) > r^^{P,e) + Kc) < exp(-c). 



We begin with the analysis of the random version of the Markov incremental gradient descent 
(MIGD) procedure. As before, a token i{t) moves among the processors in a network of n nodes, but 
now the transition matrix P governing the token is random. At time t, the transition probability 
P{i{t) = j I i(t — 1) = i) = Pij{t), where {P{t)} is an i.i.d. sequence of doubly stochastic matrices. 
Let A„ denote the probability simplex in and it(0) G A„ be arbitrary. Define the sequence 
u{t + 1) = P{t)u{t), so u{t) is the distribution of i{t) if the token has initial distribution u{0). As 
shown by Boyd et al. [7] and further studied by Duchi et al. pi], we obtain 



E[||n(t) 



l/r^llj <V^\\um\l\2{E[P{iyP{l)]Y < V^A2(E[P(l)Tp(l)])*. (22) 

Notably, with p = A2(E[P(1)^P(1)]) < 1 and K = \fn^ the estimate (I22p satisfies the conditions 
of Lemma[5l since (i^y(P*, 11) = ||ii(t) — Il/n||-^. Generally, E[P(1)'''P(1)] has much smaller second 
eigenvalue than any of the random matrices P(i) (indeed, it may be the case that A2(P(t)) = 1 
with probability 1, as in randomized gossip [7]). Using ([22]), if we define A2 = A2(E[P(1)"^P(1)]), 
we may take 

log^ 1 

-rTv(^,e) < T- and k < — 

i — A2 i — A2 

in Lemma m Applying Theorem [3] we obtain the following corollary. 
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Corollary 6. Let the conditions of Theorem\^hold, and in the notation of the previous paragraph, 
define X2 ■= A2(IE[i^(l)^-P(l)]). Fix 5 € (0,1]. With stepsize choice a{t) = a/y/t, there is a 
constant C < 4 such that with probability at least 1 — 6 



As an example of the applicability of this approach, suppose that in the network of communicat- 
ing agents used in MIGD, each communication link fails with a probability 7 G (0, 1), independently 
of the other links. Let P denote the transition matrix used by the MIGD algorithm without network 
failures. Then (under suitable conditions on the network topology; see [H] for details) 



Applying Corollary [6] and taking e = 1/T and 6 = 1/T^, we obtain (ignoring doubly logarithmic 
factors) that there is a universal constant C such that with probability at least 1 — T"^ 



Roughly, we see the intuitive result that as the failure probability 7 increases to 1, the convergence 
rate of the algorithm suffers; for 7 bounded away from 1, we suffer only constant factor losses over 
the rates in Corollary [31 

As another example of the applicability of Theorem [3l we look to problems where the statisti- 
cal sample space H is uncountable. In such scenarios, standard (finite-dimensional) Markov chain 
theory does not apply. Uncountable spaces commonly arise, for example, in physical simulations of 
natural phenomena or autoregressive processes [27], control problems [22], as well as in statistical 
learning applications, such as Monte Carlo-sampling based variants of the expectation maximiza- 
tion (EM) algorithm [IQ]. To apply results based on Assumption [Cl however, requires uniform 
ergodicity [271 Chapter 16] of the Markov chain. Uniform ergodicity is difficult to verify and often 
requires conditions essentially equivalent to compactness of H. 

Theorem [3] allows us to avoid such difficulties. For concreteness, we focus on autoregressive 
moving average (ARMA) processes, common models for control problems and statistical time series. 
In general, an ARMA process is defined by the recursion 



where ^ : M"' M"' and S : M"' R'^'"^ are measurable, the innovations Wt € M are i.i.d. and 
Cov{Wt) exists. When A{z) = Az, that is, A is identified with a matrix A G M*^^*^, and ^(z) 
is a constant matrix S, we recover the standard linear ARMA model. The convergence of such 
processes is area of recent research (e.g., [271 1211 121])! but we focus particularly on the paper of 
Liebscher [24j . As a consequence of Liebscher's Theorem 2, we obtain that if A(^) = A^^ + ^(0) 
where /i(^) = o(||^||) as ||^|| — )• 00, the matrix A satisfies pi{A) < 1, and S(^) = E is a fixed matrix, 
then there exist constants M > and p S (0, 1) such that for all t, r G N 




A2(E[P(1)'P(1)])<7+(1-7)A2(P). 




6+1 = A{^t) + mt)Wt 



(23) 



E (i^v(K+^,n) <M/j^ whenever E[||^o||] < 00. 
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Here 11 is the stationary distribution of the ARMA process (j23p . 

In particular, for any ARMA process (I23p satisfying the conditions, Lemma [5] guarantees that 
Assumption [D] holds. We thus have the following corollary (it appears challenging to obtain sharp 
constants [24l [27] , so we leave many unspecified) . 

Corollary 7. Let the stochastic process P he the nonlinear ARMA process 

it+i = Ait + h{it) + ^Wt, 

where the singular value pi{A) < 1, /i(^) = o(||^||) as ||^|| — oo, and K[\\Wt\\2] ^ Assump- 
tion\Bi hold and 6 G (0, 1). Then there exist constants M > 1, p £ {0, 1), and a universal constant 
C < 4 such that with probability at least 1 — 6 

f c>2 log- Ml. r'2 
/(^(T))-/(.^)<infC.(-^ + ^.^ + .Gi^ 

^GR / log^log(ilog^/(l-p)) \ 

Having provided Corollaries [6] and [71 we can now somewhat more concretely contrast our results 
with those of Ram et al. [31]. Ram et al.'s results (essentially) apply when the set H is finite, as they 
define their objective /(x) = h[^) functions /j; the ARMA example does not satisfy this 

property. In addition. Ram et al. assume in the MIGD case that the network of agents {!,... 
is strongly connected over time: for any t, if one defines E{t) = {{i,j) : P{t)ij > 0}, there exists 
a finite t' £ N such that U*=j-E(s) defines a strongly connected graph. This assumption need not 
hold for our analysis and fails for the examples motivating Corollary [6| 



4.4 Slowly mixing processes 

Many ergodic processes do not enjoy the fast convergence rates of the previous three examples. 
Thus we turn to a brief discussion of more slowly mixing processes, which culminates in a re- 
sult (Corollary [9]) establishing asymptotic convergence of EMD for any ergodic process satisfying 
Assumption O 

Our starting point is an example of a continuous state space Markov chain that exhibits a 
mixing rate of the form (w.l.o.g. let Af > 1 and /3 > 0) 

T^^{P,e)<Me-^. (24) 

Consider a Metropolis-Hastings sampler ^6] with the stationary distribution H, assumed (for sim- 
plicity) to have a density vr. The Metropolis-Hastings sampler uses a Markov chain Q as a "pro- 
posal" distribution, where Qi^t, •) denotes the distribution of 6.t+i conditioned on ^t, and Q{(,t, •) 
is assumed to have density q{i,t,')- The Markov chain constructed from Q and H transitions from 
a point ^1 to ^2 as follows: first, the procedure samples ^ according to Q{S,i, •)', second, the sample 
is accepted and .^2 is set to with probability min{ , 1}, otherwise ^2 = ^1- Metropolis- 

Hastings algorithms are the backbone for a large family of MCMC sampling procedures [2S| • When 
Q generates independent samples — that is, g(^, •) = q{-) for all ^ — then the associated Markov chain 
is uniformly ergodic only when the ratio q{(^) /it{(,) is bounded away from zero over the sample space 
E |271 Chapter 20]. 
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When such a lower bound fails to exist, Metropolis-Hastings has slower mixing times. Jarner 
and Roberts [18] give an example where 11 is uniform on [0, 1] and the density q{x) = (r + l)x^ for 
some r > 0; they show in this case a polynomial mixing rate (j24p with /3 = 1/r; other examples of 
similar rates include particular random walks on [0, oo) or queuing processes in continuous time. 

We now state a corollary of our main results when the mixing time takes the form (j24p . 



Corollary 8. Let x(t) evolve according to the EMD update where the sampling distribution P 
is a Markov chain with r^y(P, e) < Me~^. Assume that T > [R/G)"^. Under Assumption\M and 
withait) s gr-<«+l)/(/S+2), 



E[/(S(T)l)-/(.*)<^°™"" 



The stepsize choice a{t) = R/{G\/t) gives that 

Z\' 1 2^2+2/3 

Proof By applying the bound in Corollary [H we see that the expected convergence rate for the 
fixed setting of a{t) = a in the statement of the corollary is 

+9la + 2eGR + IMe'^G^a + ^'"^^^ <J^ + ^a + 2eGR + 3Me-^G^a, 



2Ta 2 T - 2Ta 2 



usmg the assumption that T>{R/Gf. We can choose e arbitrarily, so set e = {3l3GMa/ RY'^'^'^"' . 
Using the proposed stepsize a{t) = {R/G)T^^^^^''^^^^'^\ we find that the above is equal to 

i?2 G2 , 1 ^2M„^ GR GR 4GR(3M)^ 
H a+{2 + /3 i+/3)ai+/3(3M)i+/3Gi+/3i?i+/3 < — —+ ^ ' 



2Ta ^ ■ 2T2+/3 2T^ T^+iS 

where we use 2 + /3~''/(i+/^) < 4. Noting that /3 > yields the first statement. 

With the step size choice a{t) = a/Vt with multiplier a = R/G, we can apply Theorem [H 
along with the bound (jlOp in the proof of Corollary [2l to see that 

nmm - < ^ + 2eGR + ^^tvC^^)^^ + t^a^^gr ^^^^ 



Noting that 1/T + 2/VT < 3/VT, we turn to bounding 



2eGR+'-^^^m^<2eGR + e-^'-^. 



(26) 

T VT 



Since e does not enter into the algorithm at all, we are free to minimize over e, and taking derivatives 
we see that we must solve 
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Since (3'^/(P+'^') < e/2 and < e/2, this choice of e in the bound ([26]) yields 

inf 1 2eGR + ^^TvC^e)^-^ | < eGR{3M/2) WTi . Twh . 



By inspection, this inequahty and the convergence guarantee (|25]) give the second statement of the 
corollary. □ 



A weakness of the above bound is that the sharper rate of convergence requires knowledge 
of the mixing rate of P, and choosing the polynomial incorrectly can lead to significantly slower 
convergence. In contrast, as noted in § [3l our other bounds are robust to mis-specification of 
the step size so long as the ergodic process P mixes suitably quickly and we can choose a{t) oc 
Nonetheless, Corollary [8] gives a finite sample convergence rate whose dependence on the 
slower mixing of the ergodic process is clear. In addition, the proof of Corollary [8] exhibits a 
simple technique we can use to demonstrate that the stepsize choice a{t) = a/\/t provably yields 
convergence, both in expectation and with high probability. To be specific, note that the bound in 
Corollary [1] guarantees that for x{T) = ^ Ylt=i ^(*)) if choose a{t) = a/\/t then 

))] - f[x )< + + 3eGR + + , (27) 

where Tmix denotes either the Hellinger or total variation mixing time. The convergence guaran- 
tee (j27p holds regardless of our choice of e, so we can choose e minimizing the right-hand side. That 
is (setting a = R/G for notational convenience), 

- /(x^) < ^ + mf [seGR + ^^EE^)^ + _ 

For any fixed e > 0, the term inside the infimum decreases to AeGR as T f cx), so the infimal term 
decreases to zero as T f oo. High probability convergence follows similarly by using Theorem [21 
since for any 5^ > we have 



2VT e>o{ Jf T 



+ ^f„(P..).og^.^} (28) 



with probability at least 1 — 6t- We obtain the following corollary: 

Corollary 9. Define x{T) = ^ Ylt=i x{t). Under the conditions of Theorem\^ the stepsize sequence 
a{t) = al\ft for any q > yields f{x{T)) — >• f{x*) as T ^ oo both in expectation and with 
probability 1. 

Proof Fix 7 > and let Et denote the event that f{x{T)) — f{x*) > 7. We use the Borel-Cantelh 
lemma [6j to argue that Et occurs for only a finite number of T with probability one. Take the 
sequence 6t = l/T^ (any sequence for which log{l/ST)/T | as T — )• 00 and Y1't=i < 00 will 
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suffice) and choose some Tq such that the right-hand side of the bound (j28p is less than 7. Then 
we have 

00 00 00 00 

Y,P{f{x{T))-f{x*)>j) = Y,P{ET)<To+ P{Et) < To + Y,^T< 00. 

T=l T=l T=To+l T=l 

For any 7 > 0, we have P{f{x{T)) - f{x*) > 7 i.o.) = 0. □ 



5 Numerical results 

In this section, we present simulation experiments that further investigate the behavior of the 
EMD algorithm ([5]). Though Theorem [J] guarantees that our rates are essentially unimprovable, 
it is interesting to compare our method with other natural well-known procedures. We would also 
like to understand the benefits of the mirror descent approach for problems in which the natural 
geometry is non-Euclidean as well as the robustness properties of the algorithm. 

5.1 Sampling strategies 

For our first experiment, we study the performance of the EMD algorithm on a robust system 
identification task [33], where we assume the data is generated by an autoregressive process. More 
precisely, our data generation mechanism is as follows. For each experiment, we set the matrix A to 
be a sub-diagonal matrix (all entries are except those on the sub-diagonal), where is drawn 

uniformly from [.8, .99]. We then draw a vector u uniformly from surface of the d-dimensional 
^2-ball of radius R = 5. The data comes in pairs {Cti^t) £ IR"^ x 1^ with d = 50 and is generated as 
follows: 

= A^l, + eiWt, ft = {u,et) + Et, (29) 

where ei is the first standard basis vector, Wt are i.i.d. samples from A^(0, 1), and Et are i.i.d. 
bi- exponential random variables with variance 1. Polyak and Tsypkin [33] suggest the method of 
least-moduli for the system identification task, setting 

F{x;{i\e)) = \{x,e)-e\, 

which is optimal (in a minimax sense) when little is known about the noise distribution [33j. Our 
minimization problem is 

minimize f{x) = En [\{x,(^) - subject to ||x||2 < R, (30) 

where 11 is the stationary distribution of the AR model ()29p and we take R = 5. 

We use this experiment to investigate two issues. In addition to studying the performance 
of the EMD algorithm in minimizing the expected objective (I30p . we compare EMD to a natural 
alternative. In many engineering applications it is possible to generate samples from a distribution P 
that converges to 11, in which case a natural algorithm is to use the so-called "multiple replications" 
approach (e.g., [E]). In this approach, one specifies initial conditions of the stochastic process P, 
then simulates it for some number k of steps, and obtains a sample ^ according to P^, which 
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(hopefully) is close to 11. Repeating this, one can obtain multiple independent samples ^ from , 
then use standard algorithms and analyses for independent dataH A difficulty with this approach — 
which we see in our experiments — is that the mixing time of the process P may be unknown, and 
if P^ does not converge precisely to 11 for any finite A; G N, then any algorithm using such samples 
will be biased even in the limit of infinite gradient steps. 

As a natural representative from the multiple-replications family of algorithms, we use the 
classical stochastic gradient descent (SGD) algorithm (in the form studied by Nemirovski et al. |30j). 
To generate each sample for SGD, we begin with the point ~ ^ ^^"-^ perform k of steps of the 
procedure (f29]) . using to compute subgradients for SGD. For EMD, we use the proximal 

function ij)[x) = \ \\x\\2, which yields the direct analogue of stochastic gradient descent. To measure 
the objective value /(x), we generate an independent fixed sample of size N = 10^ from the 
process (j29|) . using f{x) w -^^^=iF{x',^i)- For each algorithm, to choose the stepsize multiplier 
a oc R/G, we estimate G by taking 100 samples and computing the empirical average of Hi- 
For EMD, we deliberately underestimate the mixing time by the constant 1 (other estimates of the 
mixing time yielded similar performance). 

In Figure [H we show the convergence behavior (as a function of number of samples) for the EMD 
algorithm compared with the behavior of the stochastic gradient method for different numbers k of 
initial simulation steps before obtaining the sample ^ used in each iteration of SGD. The line in each 
plot corresponding to SGD-A; shows the convergence of stochastic gradient descent as a function 
of number of iterations when k initial samples are used for each independent sample ^. The left 
plot in Figure [1] makes clear that if the mixing time is underestimated, the multiple-replications 
approach fails. As demonstrated by our theory, however, EMD still guarantees convergence even 
with poor stepsize choices (see also our experiments in the next section). For large enough mixing 
time estimate k, the multiple-replication stochastic gradient method and the EMD method have 
comparable performance in terms of optimization error as a function of number of gradient steps. 

^This approach is inapplicable when the data comes from a real (unsimulated) source, such as in streaming, 
online optimization, or statistical applications, though the EMD algorithm still applies. 
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Figure 2: Left: optimization error on a statistical machine learning task of the Euclidean variant 
of the EMD algorithm ([5]) versus that of the £p-norm variant with ip{x) = ^ \\x\\p, p = 1 + 1/logd, 
plotted against number of iterations. Right: robustness of the EMD algorithm ([5]) to modifications 
in the choice of stepsize. 



The right plot in Figure [T] shows the convergence behavior of the competing methods as a function 
of the number of samples of the stochastic process (I29p . From this plot, it becomes clear that using 
each sample sequentially as in EMD — rather than attempting to draw independent samples at each 
iteration — is the more computationally efficient approach. 

5.2 Robustness and non-Euclidean geometry 

In our second numerical experiment, we study an important problem that takes motivation from 
distributed statistical machine learning problems: the support vector machine problem [TT], where 
the samples ^ G M'^ and the instantaneous objective is 

F(x;0 = [l-(e,x)]+. 

We study the performance of the EMD algorithm for the distributed Markov incremental mirror 
descent framework in § 14.11 In the notation of § 14.11 we simulate n = 50 "processors," and for each 
we draw a sample of m = 50 samples according to the following process. Before performing any 
sampling, we set w to be a random vector from {x £ M.'^ : \\x\\-^^ < R}, where i? = 5 and d = 500. 
To generate the ith data sample, we draw a vector Oj S with entries aij S {—1, 1} each with 
probability ^, and set bi = sign((ai, u)). With probability .05, we flip the sign of 6j (this makes 
the problem slightly more difficult, as no vector x will perfectly satisfy bi = sign((aj, x))), and 
regardless we set = biUi. We thus generate a total of = nm = 2500 samples, and set the ith 
objective in the distributed minimization problem (|17p to be 

-. mi 

/*(^) = - E i^(^;^fc)=EnjF(^;0]=IEnJ[l-(C,x)]J, (31) 

fc=m(i-l)+l 
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where Hi denotes the uniform distribution over the ith block of m samples. Our algorithm to 
minimize f{x) = ^ fi{x) is the Markov analogue (llSp of the general EMD algorithm ([5]). We 
minimize f{x) over {x : \\x\\-^ < R} offline using standard LP software to obtain the optimal value 
f{x*) of the problem. 

We use the objectives (j3ip to (i) understand the effectiveness of allowing non-Euclidean proximal 
functions ip in the update ([5]) and (ii) study the robustness of the EMD algorithm ([5]) to stepsize 
selection. We begin with the first goal. As noted by Ben-Tal et al. [3], the choice ip(x) = ^ \\x\\p 
with p = 1 + l/log((i) yields a nearly optimal dependence on dimension in non-Euclidean gradient 
methods. Let Tmix denote the mixing time of the Markov chain (for Hellinger or total variation 
distance). Applying Corollary [3] and the analysis of Ben-Tal et al. with this choice of proximal 
function and a = R/ \Aog(dyT\^ yields 

Eimm - inf /(x) = o ^^^/^a 

since ||(9^F(x; < \\^\\^ = 1 by our sampling of the vectors Oj G {—1, l}*^, and R is the radius of 
X in ^i-norm. Compared to the Euclidean variant [20\ I34j with ^(x) = ^ ||2;||2, whose convergence 
rate also follows from Corollary [3l this is an improvement of y/ dj log d, since ^)||2 can be 

as large as \fd. 

We plot the results of 50 simulations of the distributed minimization problem in the left plot of 
Figure [2j For our underlying network topology, we use a 4-connected cycle (each node in the cycle 
is connected to its 4 neighbors on the right and left) and n = 50 nodes. The line of blue squares is 
the mirror-descent approach with '\\){x) = ^ with p = 1 + 1/ log(d) (we use d = 500), while the 

black line of circles denotes the Euclidean variant with '4'{x) = ^ ||2;||2. The dotted lines below and 
above each plot give the 5th and 95th percentiles, respectively, of the optimization error across all 
simulations. For each algorithm, we use the optimal step size setting a{t) predicted by our theory 
(recall Corollary [3]) . It is clear that the non-Euclidean variant enjoys better performance, as our 
theory (and previous work on the dimension dependence of mirror descent [311 |30l U |3]) suggests. 

The final simulation we perform is on the same problem, but we investigate the robustness of 
the EMD algorithm to mis-specified stepsizes. We take the stepsize a* predicted by our theory 
(Corollary [3]), and use a{t) = ja* /\/t for values of 7 uniformly logarithmically spaced from 7 = 10~^ 
to 7 = 10^. The plot on the right side of Figure [2] shows the mean optimality gap of x(T) after 
T = 10000 iterations for different values of 7, along with standard deviations, across 50 experiments. 
The black dotted line shows the predicted optimality gap as a function of the mis-specification (recall 
our discussion on robustness following Corollary [2]) . The EMD algorithm is certainly affected by 
mis-specification of the initial stepsize, though for a range of values of roug hly 7 = 10"^ to 7 = 10, 
the performance degradation does not appear extraordinary. In addition, our experiments show 
that our theoretical predictions appear to capture the empirical behavior of the method quite well. 



6 Analysis 

In this section, we analyze the convergence of the EMD algorithm from Section [2j Our first 
subsection lays the groundwork, gives necessary notation, and provides a few optimization-based 
results. The second subsection contains the proofs of results on expected rates of convergence, 
while the third subsection shows how to achieve convergence guarantees with high probability. The 
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fourth subsection shows the convergence of the EMD method under probabihstic (random) mixing 
times, while the final subsection proves the order-optimality of the EMD method. 



6.1 Definitions, assumptions, and optimization-based results 

To state our results formally, we begin by giving a few standard definitions and collecting a few 
consequences of Assumptions [Al and IB] that make our proofs cleaner. Recall the measurable selection 
G, where G(x; ^) G dxF{x;^) represents a fixed and measurable element of the subgradient of F{-;^) 
evaluated at x, and the EMD algorithm ([5]) has g{t) = G(x(t);^t). By our assumptions on F, for 
any distribution Q for which the expectations below are defined, expectation and subdifferentiation 
commute [371 E] : 

/q(x) ■.= Eq[F{x;C)] = jj{x-i)dQ{i) then a/Q(x) = Eq[9F(x; C)]. 

In particular, En[9i^(x;^)] = df{x) and En[G(x;^)] G df{x). In addition, the compactness as- 
sumption that D^{x*,x{t)) < ^i?^ for all t coupled with the strong convexity of -0 implies 

\\x{t) - x*f <2D^{x*,x{t)) < so \\x{t) - x*\\ < R. (32) 

We now provide two relatively standard optimization-theoretic results that make our proofs 
substantially easier. To make the presentation self-contained, we give proofs of these results in 
Appendix |X1 The two lemmas are essentially present in earlier work |3H [3] , but our stochastic 
setting requires a bit of care. 

Lemma 10. Let x{t) be defined by the EMD update For any r G N and any x* G X , 
^ F(x(t);6)-F(x^6)<^i?'+ E ^Uml- 

t=T+l ^ ' t=T+l 

Lemma 11. Let x{t) be generated according to the EMD algorithm Then 

\\x{t)-x{t + l)\\<a{t)\\g{t)\l. 



6.2 Expected convergence rates 

Now that we have established the relevant optimization-based results and setup in Section 16. H the 
proof of Theorem [1] requires that we understand the impact of the ergodic sequence Ci)^2)--- on 
the EMD procedure. The key equality that allows us to prove Theorems [1] and [2] is the following: 
for any r > 0, 

T T-T 



t=l t=l 

T 



f{x{t)) - fix'') = fi^it)) - /(^') - ^(^(i); + F{X*; 6+r) 

t=l 

T-T 

+ Y F{x{ty,Ct+r) - F{x{t + r); 6+r) (33) 
t=i 

T T 

+ Y n<tUt)-F{x^-it)+ f{<t))-f{x^). 



t=T + l t=T-T+l 
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We may set r = in the expression (j33|) , taking expectations and applying Lemma [TOl to 
recover the known convergence rates [30] for the stochastic gradient method with independent 
samples. However, the essential idea that the expansion (j33p allows us to implement is that for 
large enough r, the sample ^t+r is nearly "independent" of the parameters x{t), since the stochastic 
process P is mixing. By allowing r > 0, we can bound the four sums (j33p using a combination 
of Lemmas [10] and [TTl then apply the mixing properties of the stochastic process P to show that 
F{x{t); ^t+r) is a nearly unbiased estimate of f{x{t)): 

E[f{x{t))-F{x{ty,^t+r)]^0. 

We formalize this intuition with two lemmas, whose proofs we provide in Appendix [Bl 

Lemma 12. Let x be J^t-i^^sasurable and r > 1. If Assumption \A\ holds. 

|E [fix) - f{x^) - F{x; 6+0 + F{x^; | < 3GR ■ 4ei (^{f . 

// Assumption \B\ holds, 

|E [fix) - fix*) - Fix; 6+0 + Fix*; 6+0 \Tt]\<GR- d^^ (i^ff", n) . 

The next lemma applies a type of stability argument, showing that function values between a;(t) 
and xit + t) cannot be too far apart. 

Lemma 13. Let r > and ait) be non-increasing. If Assumption\Mholds, then 

E[F(x(t); 6+r) - Fixit + r); 6+r) | J't-i] < Tait)G\ 
If Assumption O holds, then 

6+r) - Fixit + r); 6+r) < Tait)G\ 

We can now apply Lemmas [T0Hl3] to give the promised proof of Theorem [TJ 
Proof of Theorem [T] The equality (I33p is non-probabilistic, so all we need to complete the proof 
is to take expectations, applying the preceding lemmas. First, we map r to r — 1 in the previous 
results, which will make our analysis cleaner. Throughout this proof, the quantity H) will denote 
3c?hei('5n) when we apply Assumption 1X1 and will denote (•,]!) when using Assumption [B| as 
the proof is identical in either case. We control the expectation of each of the four sums (j33p in 
turn. First, we apply Lemma [12] to see that 

T-r+l T-T+1 

E[/(x(t)) - fix*) - F(x(t);6+r-i) +i^(x^6+r-l)] <GR IE[d(P*+\ H)]. 

t=l t=Q 

The second of the four sums (j33p requires Lemma [TS] which yields 

T-T+l T-r+1 

Y E[F(x(t);6+r-i)-i^(x(t + T-l);6+r-i)] < (t-1)G2 Y 
t=l t=l 
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Lemma [TO] controls the third term in the series (j33p . and taking expectations gives E[||g(t)||^] < G^. 
The final term in the sum (1330 is bounded by (r — 1)RG when either of the Lipschitz assumptions lAl 
or [B] hold. Summing our four bounds, we obtain that for any t > 1, 



E 



T 



T-T+1 



<GR E 



t=i 



t=i 



T-T + 1 



r>2 /-»2 ^ 

it Lt V — ^ . ^ 



2a{T) 



t=T 



(34) 



t=i 



Assumption ICl states that there exists a uniform mixing time rmix(-P) e) (for both total variation 
and Hellinger mixing) such that (i(Pj^^y~^, IT) < e. Applying the definition of Tmix for Hellinger or 
total variation mixing completes the proof. □ 



6.3 High-probability convergence 

In this section, we complement the convergence bounds in Section 16.21 with high-probability state- 
ments. We use martingale theory to show that the bound of Theorem [1] holds with high probability. 
We begin from the same starting point as the proof of Theorem [T] — with the expansion (|33p — but 
now we show that the random sum 

T-T+1 

Y f{x{t))-f{x*)-F{x{ty,Ct+T-i)+F{x*;^t+T-i) (35) 
t=i 

is small with high probability. Intuitively, this follows because given the initial t — t samples 
^1, . . . ,^t-r, the tth sample is almost a sample from the stationary distribution 11. With this 
in mind, we can show that an appropriately subsampled version of the above sequence behaves 
approximately as a martingale, and we can then apply Azuma's inequality [2] to derive high- 
probability guarantees on the sum (|35p . 

Proposition 1. Let Assumption\Bi hold and 6 G (0, 1). With probability at least 1 — 5, for r G N 
with T G [l,r/2], 

T-T + 1 

Y ifi^it)) - F(x(t); 6+r-i) + F{x^;^t+T-i) - f{x'')] 
t=i 




We provide a proof in Appendix [C] and can now prove Theorem [2j 
Proof of Theorem [2] The proof is a combination of the proofs of previous results. Starting 
from the expansion psp . we use Lemma [13] to see that 

T-T+1 T-T+1 

Y F{x{t);^t+T-i)-F{x{t + T-l);^t+T-i)<{T-l)G' Y 
t=i t=i 
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and applying the G-Lipschitz continuity of the functions £iiid compactness of X we obtain 

^ f{x{t)) - fix*) < (r - 1)GR. 

t=T-T+2 

In addition, the convergence guarantee in Lemma [10] guarantees that 

rp rj-i 
t=T ^ ' t=l 



(36) 



Combining these bounds, we can replace the equality (|33|) with the bound 



t=i 



t=i 



t=i 



T-T+l 



[f{x{t)) - 6+r-i) + 6+r-i) - fix*)] 



t=l 



which holds for any r > 1. What remains is to replace the last term in the non-probabilistic 
bound (136p with the upper bound in Proposition [H which holds with probability 1 — S, and then 
to replace r with T^y(P, e), which guarantees the inequality (i^y(Pn_ ,,11) < e. 



□ 



6.4 Random mixing 

In this section, we give the proof of Theorem [3j The proof is similar to that of Theorem [2l but we 
need an auxiliary lemma that allows us to guarantee that the mixing times are bounded uniformly 
for all times and for all desired accuracies of mixing e. See Appendix [D] for the proof of the lemma. 

Lemma 14. Let Assumption\D\ hold and 5 € (0, 1). With probability at least 1 — 6, 
max sup (r^v(P[,],e) -r^v(P,e)) < k ( log ^ + 2 log(r) ) . 

Rewriting Lemma [U] slightly, we may define r = r^y(P, e) + ^(log | + 21og(T)), and we find 
that with probability at least 1 — 5, 

d,.{P[:l^,n)<e (37) 
for all s G {1, . . . , T} and for all e > with r^^ (P, e) < T. This leads us to 

Proof of Theorem [3] All that is different in the proof of this theorem from that of Theorem [5] 
is that in the penultimate inequality (j36p . when we apply Proposition [H we no longer have the 
guarantee that (Pj*_^j , 11) < e for all t. To that end, let e be such that r^y(P, e) < T. Apply 
Lemma [ni and its consequence (f37|) . which states that if we take r = t^^{P, e) + ^(log ^ + 2 log(r)), 
then we obtain that d^^ (^[i-r] > n) < e with probability at least 1 — 5. If r^y (P, e) > T, the bound 
in the theorem holds vacuously, so we may extend the result to all e > 0. □ 
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6.5 Lower bounds on optimization accuracy 

Our proof of Theorem [J] mirrors the proof of Theorem 1 in the paper by Agarwal et al. [1], so we 
are somewhat terse in our description and proof. The intuition in the proof is that if the stochastic 
process P returns a sample from the stationary distribution 11 every r timesteps, otherwise returning 
a sample identical to the previous one, then the convergence rate of any algorithm should be a factor 
of r slower than if it could receive independent samples from 11. Mesterharm |26) employs a similar 
approach to give a lower bound on the performance of online learning algorithms. More formally, 
by using an identical construction to [H Section IV. A], we may reduce the problem of minimization 
of a function / : R'^ — t- M to identifying the bias of d coins. To that end, let V C { — 1, l}'^ be a 
packing of the d-dimensional hypercube such that v,!^' £ V with v ^ v' satisfy — z^'H^^ > (i/2; it 
is a classical fact [25] that there is such a set with cardinality |V| > (26^2)^^/2. 

Now for a fixed r G N, consider the following sequential sampling procedure, which generates 
a set of pairs of random vectors {(C/^, 1^)}^^. Choose a vector € V uniformly at random and let 
(5 E (0,1/4]. Let Py denote the distribution (conditional on v) that corresponds to the following: 
for each t, construct samples according to 

(a) If (t - 1) mod r / 0, take Ut = Ut-\ and Yt = Yt-x. 

(b) Otherwise, pick a uniformly random subset Ut C {1, . . . ,d} of size \\Jt\ = m, then 

(i) For each i £ Ut, construct a random variable Ci such that Ci = 1 with probability ^ + uid 
and Ci = —1 with probability | — 

(ii) Construct the vector Yf G {—1,1}'^ such that Yt^i = Cj if i G Ut, and otherwise Yt^i is 
uniform Bernoulli, that is, if i ^ Ut then Yt^i = 1 with probability ^ and Yt^i = — 1 with 
probability ^. 

This sampling procedure yields a sequence = {Ut,Yt), where if IIi^ is the distribution of a pair 
(U, Y) such that U C {1, . . . , d} is chosen uniformly at random with size \U\ = m and Y is sampled 
according to the steps (i)-(ii) above, then Uy is the stationary distribution of P^. Moreover, we see 
that (Pj*^^"'''^ , n) = for any k > and any t, since the distribution Py corresponds to receiving 
an independent sample (U, Y^ from IIj^ every r steps. 

Let I{X;Y) denote the mutual information between random variables X and Y and let H[X) 
denote the (Shannon) entropy of X. By inspection of Agarwal et al.'s proof [1., Lemma 3], 
since 11,^ is the stationary distribution oi Py, a tight enough bound on the mutual information 
/((C/i, Yi), . . . , {Ut,Yj'); v) proves Theorem [4l Hence, we provide the following lemma: 

Lemma 15. Let the sequence ^t = {Ut,Yt) be generated according to the steps ((a|-(jb]) above. Then 
for Se {0,1/ A], 



I{{Ui,Yx),...,{Ut,Yt);u) < 16 



Proof Our sampling model (|aj)-(jb]) sets blocks of size r to be equal, that is, {Ui,Yi) = ■ ■ ■ = 
{Ut, Yr), {Ur+i, Yr+i) = ■ ■ ■ = (C/2r, ^2t), and SO on, whereas different blocks are independent given 
the variable v. We thus see that by the definitions of mutual information, conditional entropy, and 
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that entropy is sub-additive [12], 
I{{Ui,Y^),...,{Ut,Yt);u) 

= H{{Ui,Yi), . . . , {Ut, Yt)) - H{{Ui,Yi), . . . , {Ut, It) | i^) 

= Hi{UuY,),...,iUT,YT))- ^ F((;7(fc_i),+i,y(fc_i),+i),...,([7fc,,nO|z.) 

k=l 

< [H y(fc_i)^+i), . . . , iUkT,YkT)) - H ((C/(fc_i)^+i, Y(fc_i)^+i), . . . , {UkT,YkT) I J^)] 

k=l 

\T/r] \T/t] 

= I iiU(^k-l)r+l,Y(^k-l)r+l), ■ ■ ■ ,iUkr,Ykry,J^) = / ((C/fc,, Ffc,); z.) . (38) 

k=l k=l 

In the last hne we have used that within the same block of size r, all (Ut, Yt) pairs are equal. Now, 
using the bound (j38p . we apply an identical derivation as that given in the proof of Agarwal et 
al.'s Lemma 3 (following Eq. (25) there). For any fixed k we have I{{UkT,YkT)', i') < IGmJ^, which 
completes the proof of the lemma. □ 



Proof of Theorem[4] Use Agarwal et al.'s construction (see Eq. (16) in Section IV. A of [T]) of a 
"difficult" subclass of functions, then in the proof of Theorem 1 from [T], replace their coin-flipping 
oracle with the steps (jaj)-® and applications of their Lemma 3 with Lemma [15] above. □ 



7 Conclusions 

In this paper, we have shown that stochastic subgradient and mirror descent approaches extend 
in an elegant way to situations in which we have no access to i.i.d. samples from the desired 
distribution. In spite of this difficulty, we are able to achieve reasonably fast rates of convergence 
for the ergodic mirror descent algorithm — the natural extension of stochastic mirror descent — under 
reasonable assumptions on the ergodicity of the stochastic process {£,t} that generates the samples. 
We gave several examples showing the strengths and uses of our new analysis, and believe that 
there are many more. In addition, our results give a relatively clean and simple way to derive 
finite sample rates of convergence for statistical estimators with dependent data without requiring 
the full machinery of empirical process theory (e.g., [42J). Though we have provided lower bounds 
showing that our analysis is tight to numerical constants, it may be possible to sharpen our results 
for interesting special cases, such as when the distribution of the stochastic process {^t} has nice 
enough Markovianity properties. We leave such questions to future work. 
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A Proofs of Optimization Results 

Proof of Lemma 1101 The proof of the lemma begins by controUing the amount of progress 
made by one step of the EMD method, then summing the resulting bound. By the first-order 
convexity inequality and definition of the subgradient g{t), we have 

F{x{t);Ct)-F{x*;Ct) < {g{t),x{t) - x*) 

= {g{t),x{t + 1) - X*) + {g{t),x(t + 1) - x{t)) . (39) 

For y ^ X, the first-order optimality conditions for x{t + 1) in the update ([5]) imply 

{a{t)g{t) + Vi;{x{t + 1)) - V7p{x{t)),y - x{t + 1)) > 0. 

In particular, we can take y = x* in this bound to find 

a{t) {g{t),x{t + 1) - X*) < {V4^{x{t + 1)) - Vi^{x{t)),x* - x{t + 1)) . (40) 

Now we use the definition of the Bregman divergence D^, to obtain 

{Vip{x{t + 1)) - Vip{x{t)),x* - x{t + 1)) 

= D^{x*,xit)) - D^{x*,x{t + 1)) - D^{x{t + l),x{t)). 



Combining this result with the expanded gradient term (|39p and the the first-order convexity 
inequality (j40]l . we get 



F{x{t)-it) - F{x* ; 6) < -^D^{x*,xit)) - -^D^{x* , x{t + 1)) 

a[t) a{t) 

^ D^{x{t + l),x{t)) + {g{t), x{t + 1) - x{t)) 



a{t) 

< -l^D4x*,xit)) - -l^D^ix*,xit + 1)) - -^D^xit + l),xit)) 
a{t) a[t) a{t) 

"(0 II u\u2 , 1 11^^^ , -,^ ^u\u2 



< -^^D^{x*,x{t)) - -^^D^{x\x{t + 1)) + 4^ hml ■ 



1 II l|2 



The inequality (i) is a consequence of the Fenchel- Young inequality applied to the conjugates 
and ^ ||-||^ (see, e.g., [U Example 3.27]), while the inequality (ii) follows by the strong convexity of 
Tp, which gives D^{x{t + l),x(t)) > | \\x{t + 1) - x(t)f . 
Summing the final inequality, we obtain 

T 

[F{x{ty,^t)-F{x*-^t)] 

t=T+l 

^ E -^AD^{x*,x{t))-D^{x*,x{t +!))]+ ^ii^r'" 

t=r+l"^^^ t=T+l ^ 



28 



Using the compactness assumption that D^{x*,x) < -^R for all x G A', we have 



^[D^{x\x{t))-D^{x\x{t + l))] 



t=T + l 



T 

< D4x*,x{t)) 

t=T + 2 

?2 T 



a{t) a{t - 1) 



+ 



a(r + 1 



-D^(x*,x(r + 1)) 



- 2 ^ 

t=T + 2 



1 



1 



a{t) a{t - 1) 



+ 



1 



2a{T + i; 



i?2 



2a(T) ' 



where for the last inequality we used that the stepsizes a{t) are non-increasing. 



□ 



Proof of Lemma llll By the first-order condition for the optimality of x{t+l) for the update ([5]), 
we have 

{a{t)g{t) + Vi;{x{t + 1)) - Vij{x{t)),x{t) - x{t + 1)) > 0. 

Rewriting, we have 

{V^{x{t)) - V7p{x{t + l)),x{t) - x{t + 1)) < a{t) {g{t),x{t) - x{t + 1)) 

<a{t)\\g{t)\l\\x{t)-x{t + l)\\ 

using Holder's inequality. Simple algebra shows that 

D^{x{t),x{t + 1)) + D^{x{t + l),x{t)) = (Vijixit)) - V^{x{t + l)),x{t) - x{t + 1)) , 

and by the assumed strong convexity of V'j we see 

\\x{t) - x{t + l)f < D^{x{t + l),x{t)) + D^{x{t), x{t + 1)) 
<a{t)\\g{t)\l\\xit)-xit + l)\\. 

Dividing by — x(t + 1)|| gives the desired result. □ 



B Mixing and expected function values 

Proof of Lemma 1121 Since x G J^f, we may integrate only against ^ when taking expectations, 
which yields 

E [fix] - fix*) - F{x; + F{x*; ^t+r) \ J't] 

= J {F{x;0-F{x'';0)dm)- J{nx;0 " F(x^ 0)rf^'5^(0• 
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Since we assume P^^j and IT have densities pj^j and vr with respect to a measure /i, this difference 
becomes / {F{x;^) — F{x*;S^)){7r{£^) — p^^'^ {£^))dfi{^) . Setting p = p^^^'^ for shorthand, we obtain 



{Fix;C)-F{x*;C))i7TiO-pmdKO 



< / \F{x;C) - Fix*;C)\\piC) - 7riC)\d^{C) 



J \Fix; - F{x*; 0\ {VW) + VRO) \ " VRO] d^O 




{F{x;C)-F{x*;0)HV<0 + 



'd/i(04el(^'[if",n) 



by Holder's inequahty. Applying the inequality (a + 5)^ < 2a^ + 26^, valid for a, 6 G M, we obtain 
the further bound 



V2(Eu[{F{x;0 - F{x*;Of]+K[{F{x;Ct+r) - F{x*;Ct+r)f I -F^]) ^ dhci(P{f H). 



(41) 



To control the expectation terms in the bound (I4ip . we now use AssumptionlAl By the (P-almost 
sure) convexity of the function x i— )• -F(x;^), we observe that 

F{x;^)-F{x*;C)<{<^{x;0,x-x*) and F{x*;C) - F{x;C) < {<^{x*;0,x* - x) . 
Combining these two inequalities, we see that 

{F{x; - F{x*; O f < max { (G(x; C), x - x*f , (G(x*; 0,^* - x)^} 
<max{||G(x;C)L',||G(x^OII*} 
<max{||G(x;OIL',||G(x^Oll'}^' 



\x — X 
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„*||2 



where the last inequality uses our compactness assumption (jl]) . Now we invoke Assumption |X] 
combined with the above inequality to obtain the further bound 



E 



{F{x;(t+r)-F{x'';^t+r)f\Tt 



G(x;6+r)||' + ||G(x^6+r)||'|-F^ 



An analogous argument yields the same bound for the expectation under the stationary distribution, 
so based on our earlier bound dlTl) we have 
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This completes the proof of the first statement of the lemma. 

The second statement is simpler: apply Assumption iBl to obtain 

{F{x-o - F{x*;m<o - piO)MO <grJ \p{o-<omo- 

Observing that the above bound is equal to GRdrj,y (^P^^^ completes the proof. 



□ 



Proof of Lemma 1131 For any x measurable with respect to the cr-field J-g, we can define the 
function /i[s](x) = K[F{x;S,s+i) \ J^s]- Assumption lAl implies that h^gj is a G-Lipschitz continuous 
function so long as its argument is Js-measurable, that is, — < G||a; — y|| for 

x,y £ J-g. In turn, this implies that 

E[F{x{t);Ct+r)-F{x{t + T);Ct+r) \ -^i-i] 

t+T-l 

= E[F{x{sy,^t+r) - F{X{S + iy,^t+r) \ J't-l] 

s=t 

t+T-1 

= ^ E [E[F(x(s); 6+r) - F{x{s + 1); 6+r) | J't+r-l] I J't-i] 
s=t 

i+r-1 

< E[G\\x{s)-x{s + l)\\ \Tt-i] 

s=t 

since x(s) is J-t+r-i -measurable for s < t + t. Now we apply Lemma [TT\ which shows that 
\\x{s) — x{s + 1)11 < a(s) ||5'('S)||^, and we have the further inequality 

t+T-l 

E[Fix{ty,Ct+r)-F{x{t + ry,(t+r)\J't-i]< Ga{s)E[\\g{s)\l \ Tt-i]- 

s=t 

Applying Jensen's inequality and Assumption El we see that 

nUs)L I ^t-i] < yjE[n\\g{sy\l\Tg^i]\Ft-i] < = G. 
In conclusion, we have the first statement of the lemma: 

t+T-l 

E[F{x{ty ^t+r) - F{x{t + r); | J't-i] < Y «(^) ^ GVa(t), 

s=t 

since the sequence a{t) is non-increasing. The proof of the second statement is entirely similar, but 
we do not need to apply conditional expectations. □ 



C Proof of Proposition [T] 

Proof of Proposition [1] We construct a family of r different martingales from the summation 
in the statement of the proposition, each of which we control with high probability. Applying a 
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union bound gives us control on the deviation of the entire series. We begin by defining the random 
variables 

Zt := /(x(t - T + 1)) - F{x{t - r + 1); 6) + Ct) - /(x*), 

noting that 

T T-r+1 

Y.Zt= [f{x{t))-F{x{ty,^t+r-i) + F{x*;^t+r-i)-f{xn]- 

t=T t=l 

By defining the filtration of fj-fields = J-ri+j for j = 1, . . . ,r, we can construct a set of Doob 
martingales {Xl,X2, ■ ■ ■} for j = 1, . . . , r by making the definition 

Xj := Zri+j — E,[ZT-i+j I = Z-j-i+j — MlZri^j I J- r{i-l)+j] 

= f(x{T{i - 1) + J + 1)) - F{x{T{i - 1) + i + 

+ F{x*;U+j) - fix') - E[Zt I Tr{^-l)+J]■ 

By inspection, is measurable with respect to the cr-field A^, and E[X^ \ Al_i\ = 0. So, for 
each j, the sequence {Xj : z = 1, 2, . . .} is a martingale difference sequence adapted to the filtration 
{A{ : i = 1, 2, . . .}. Define the index set to be the indices {1, . . . , [T/t\ +1} for j <T-t [T/t\ 
and {1, . . . , [T/rJ} otherwise. With the definition of Xj and the indices we see that 

t=T j = li^X{j) t=T j = l i = l t=T 



Now we note the following important fact: by the compactness assumption (|32p and Assump- 
tion [HI the Jv(j_i)+j-measurability of f{x{T{i — 1) + j + 1)) implies 

l^il = \Zri+j - ¥.[Zri+j I Jv(j_i)+j]| < 2GR. 

This bound, coupled with the representation (]42p . shows that Ylt=T ^ sum of r different 

bounded-difference martingales plus a sum of conditional expectations that we will bound later. 
To control the martingale portion of the sum (I42p . we apply the triangle inequality, a union bound, 
and Azuma's inequality [2] to find 

v2 



^(t E > 7) ^ E > 7) ^ E-p 

^ .7=1 iexf?) ^ .7=1 ^iex(i) ^ .7=1 



since there are fewer than 2T/r terms in each of the sums Xj (by our assumption that T/2 > t) 
Substituting 7 = AGR^/Tt log(T/(5), we find 

^(E E > 4Gi?^rrlogj) < 5. 

To bound the final term E[Zj | -F(-r] in the sum (j42p . we recall from Lemma [T2] that 

3t 



Summing this bound completes the proof. □ 
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D Probabilistic Mixing 

Proof of Lemma [5] Using the definitions in the statement of the lemma, take 

I fP \^ I ^ log 7 , logi^ ^ c 



log7| |log7| |log7| 
which implies by Markov's inequality that 



/ t+r ^T^ ^ W ^ i^exp(-log|)exp(-logi^) 
c^Tv \ > nj > ej < — — < ^- exp(-c) = exp(-c) 



since 7"/l'°S7l = exp(— a) for < 7 < 1. Noting that 

for any r G N completes the proof. □ 

Proof of Lemma 1141 We use a covering number argument, which is common in uniform 
concentration inequalities in probability theory (e.g., [39]). For each t G {1, . . . ,T}, define 

et :=inf{e > : T^^{P,e) < t} . 

By the right-continuity of e 1— )• T^y{P, e), we have t^^{P, ej) < t but t^^{P^ — (5) > t for any 5 > 0. 
As a consequence, we see that for some e > to exist satisfying r^^ (P[5] , e) > r^^ (P, e) + c, it 
must be the case that 

^TV iP[s\ > ) - '^TV ^t)> C 

for some ej, where t G {1, . . . , T}. That is, we have 

P (r^v (P[s] , e) > r^y (P, e) + c for some s G {1, . . . , T} and e > e-r) 

Applying a union bound and Assumption [Dl we thus see that for any c > 0, 

P I niax sup (r^v (P[^] , e) - r^v (P, e)) > c j 

< max P (r^^ (P[,] , e*) > r^^ (P, e) + c) < exp (-c/k) . 

Setting the final equation equal to 6 and solving, we obtain c = K[log(l/(5) + 21og(T)], which is 
equivalent to the statement of the lemma. □ 
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